Matthew Talluto
13.02.2022
R currently has two dominant graphical engines
We will mostly use base graphics in the course
If you want to learn ggplot, ask Matt
histhistThe defaults are not very nice, so lets improve things
main = "" disables the titlexlab and ylab control axis labelsbreaks controls the number of bins in the
histogramcol sets the color of the barsborder sets the color of the borders (NA:
no border)#RRGGBB
00 (none) to FF
(most)col = rosybrowncolors() function for the namesThe population mean (\(\mu\)) can be approximated with the sample mean:
\[ \mu \approx \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \]
The mean can be strongly influenced by outliers.
The mean can be strongly influenced by outliers.
# mode
mode(my_var) # wrong!
## [1] "numeric"
# we can use the histogram function to approximate the sample mode
# changing the number of breaks will have a large impact on the results
my_hist = hist(my_var, breaks = 30, plot = FALSE)
# the mids variable gives you the midpoint of each bin
# counts gives you the count each bin
# cbind shows them together in columns
head(cbind(my_hist$mids, my_hist$counts))
## [,1] [,2]
## [1,] 3 35
## [2,] 5 107
## [3,] 7 161
## [4,] 9 175
## [5,] 11 135
## [6,] 13 130
# use this to get the mode
# first find out which one is the tallest with which.max
(mode_position = which.max(my_hist$counts))
## [1] 4
my_hist$mids[mode_position]
## [1] 9We can compare variables in a way that is location independent by centering (subtracting the mean)
\[ \sigma^2 = \frac{1}{N}\sum_{i=1}^N (X_i-\mu)^2 \]
We can estimate \(\sigma^2\) using the sample variance:
\[ \sigma^2 \approx s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i -\bar{x})^2 \]
It is convenient to talk about the scale of \(x\) in the same units as \(x\) itself, so we use the (population or sample) standard deviation:
\[ \sigma = \sqrt{\sigma^2} \approx s = \sqrt{s^2} \]
max(x) - min(x)# the range function gives you the min and max
# Take the difference to get the statistical range
diff(range(my_var))
## [1] 50.4
max(my_var) - min(my_var)
## [1] 50.4
IQR(my_var)
## [1] 6.69
quantile(my_var, 0.75) - quantile(my_var, 0.25)
## 75%
## 6.69
# coefficient of variation can be done manually
sd(my_var)/mean(my_var)
## [1] 0.531Is the distribution weighted to one side or the other?
\[ \mathrm{skewness} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^3}{(n-1)s^3} \]
How fat are the tails relative to a normal distribution?
\[ \mathrm{kurtosis} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^4}{(n-1)s^4} \]
boxplot functiony ~ group is a special data type called a
formulaboxplot functiony ~ group is a special data type called a
formulapenguins datasetmean separately to every variable in
penguinssimple apply (sapply)# bad!
(col_means = c(
mean(penguins[,3]),
mean(penguins[,4]),
mean(penguins[,5]),
mean(penguins[,3]) # it's easy to introduce mistakes this way!
))
## [1] 44.0 17.2 201.0 44.0
# better
# use columns 3:6 because these are the numeric columns
# doesn't make sense to take the mean of penguins$species
(col_means = sapply(penguins[,3:6], mean)) # apply the function mean to each variable
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 44.0 17.2 201.0 4207.1mean separately to every variable in
penguinssimple apply (sapply)sapply(penguins[,3:6], range)
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## [1,] 32.1 13.1 172 2700
## [2,] 59.6 21.5 231 6300
# here, we pass an additional argument named probs to quantile
# see ?quantile for what this does
sapply(penguins[,3:6], quantile, probs = c(0.25, 0.5, 0.75))
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 25% 39.5 15.6 190 3550
## 50% 44.5 17.3 197 4050
## 75% 48.6 18.7 213 4775tapply: tabular apply
bill_length_mm) based on categories in another variable
(e.g., species)Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).
Question: What is the long-run distribution of positions on the field?
Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).
Question: What is the long-run distribution of positions on the field?
t = 10
Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).
Question: What is the long-run distribution of positions on the field?
t = 50
Thought experiment
You and 99 of your closest friends gather on the centre line of a 100-m football pitch. We define the centre line as 0m, the western boundary -50 m, and the eastern boundary as +50 m.
Each person flips a fair coin. If the coin is heads, they take a step east (add 0.5 m to their location), if its tails, they take a step west (subtract 0.5 m from their location).
Question: What is the long-run distribution of positions on the field?
t = 500
We can simulate this experiment in R. Here is code for doing it for one person:
And for a larger sample (2500) friends, doing more coin flips (500 each), with more concise code.
# repeat the initial position 2500 times, one per friend
n_friends = 2500
positions = rep(0, n_friends)
# repeat the experiment 500 times
time_steps = 1:500
for(i in time_steps) {
# flip a coin for each friend
# returns one heads or tails for each n_friends
coin_flips = sample(c("heads", "tails"), size = n_friends, replace = TRUE)
# compute the steps: +0.5 if the flip is heads, -0.5 if tails
moves = ifelse(coin_flips == "heads", 0.5, -0.5)
# finally add the moves to the old positions and save the new position
positions = positions + moves
}density function in R to add a curve
approximating this density.hist(positions, breaks=40, col="gray", main = "", freq=FALSE)
lines(density(positions, adjust=1.5), col='red', lwd=2)
mu = mean(positions)
sig = sd(positions)
x_norm = seq(min(positions), max(positions), length.out = 400)
y_norm = dnorm(x_norm, mu, sig)
lines(x_norm, y_norm, lwd=2, col='blue')
legend("topright", legend=c(paste("sample mean =", round(mu, 2)),
paste("sample sd =", round(sig, 2))), lwd=0, bty='n')scale function.scale function.\[ \mathcal{f}(x) = \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} \]
\[ \mathcal{g}(x) = \int_{-\infty}^{x} \frac{1}{\sigma \sqrt{2\pi}} \mathcal{e}^{-\frac{1}{2} \left (\frac{x-\mu}{\sigma} \right )^2} dx \]
PDF: what is the probability density
when \(x=3\) (the height of the bell
curve)
CDF: what is the cumulative probability
when \(x=q\)
(area under the bell curve from \(-\infty\) to \(q\))
(probability of observing a value < \(q\))
Quantiles: what is the value of \(x\), such that the probability of observing x or smaller is \(p\)
(inverse of the CDF)
RNG: Random number generator, produces \(n\) random numbers from the desired distribution
rnorm(n = 10, mean= 0, sd = 1)
## [1] 0.606 -1.230 -1.041 0.362 1.731 -0.855 -0.579 1.140 0.852 0.711R supports many distributions, we will discuss others as they come up.
\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \approx \frac{s}{\sqrt{n}} \]